Sparse and skew hashing of K-mers

Abstract Motivation A dictionary of k-mers is a data structure that stores a set of n distinct k-mers and supports membership queries. This data structure is at the hearth of many important tasks in computational biology. High-throughput sequencing of DNA can produce very large k-mer sets, in the size of billions of strings—in such cases, the memory consumption and query efficiency of the data structure is a concrete challenge. Results To tackle this problem, we describe a compressed and associative dictionary for k-mers, that is: a data structure where strings are represented in compact form and each of them is associated to a unique integer identifier in the range [0,n). We show that some statistical properties of k-mer minimizers can be exploited by minimal perfect hashing to substantially improve the space/time trade-off of the dictionary compared to the best-known solutions. Availability and implementation https://github.com/jermp/sshash. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
A k-mer is a string of length k over the DNA alphabet fA; C; G; Tg. Software tools based on k-mers are in widespread use in Bioinformatics. Many large-scale analyses of DNA share the elementary need of determining the exact membership of k-mers to a given set S, i.e. they rely on the space/time efficiency of a dictionary data structure for k-mers (Chikhi et al., 2021). This work proposes such an efficient dictionary. More precisely, the problem we study here is defined as follows. Given a large string over the DNA alphabet (e.g. a genome or a pan-genome), let S be the set of all its distinct k-mers, with jSj ¼ n. A dictionary for S is a data structure that supports the following two operations: • for any k-mer g, LookupðgÞ returns a unique integer 0 i < n if g 2 S (i is the 'identifier' of g in S) or i ¼ -1 if g 6 2 S; • for any 0 i < n; AccessðiÞ extracts the k-mer g for which LookupðgÞ ¼ i.
By means of the Lookup query, the dictionary is able to answer membership queries in an exact way (rather than approximate) and to associate satellite information to k-mers (such as abundances). Thanks to the Access query, the original set S can be reconstructed, meaning that the dictionary is a self-index for S.
In sequence analysis tasks, it is very often the case that we are given a pattern P of length jPj ! k and we are interested in answering membership to S for all the k-mers read consecutively from P, i.e. for P½i; i þ kÞ; i ¼ 0; . . . ; jPj À k. For example, we may decide that the whole pattern P is present in a genome if the number of kmers of P that belong to S is at least h Á ðjPj À k þ 1Þ, for a prescribed coverage threshold h > 0, such as h ¼ 0:8 (Bingmann et al., 2019;Solomon and Kingsford, 2016). In other words, Lookup queries are often issued for consecutive k-mers (one being the previous shifted to the right by one symbol) (Robidou and Peterlongo, 2021). While it is obviously possible to perform jPj À k þ 1 Lookup queries for a pattern of length jPj, it also seems profitable to answer 'Is P½i; i þ kÞ a member of S?' more efficiently knowing that the previous k-mer shares kÀ1 symbols with P½i; i þ kÞ. We regard this latter scenario as that of streaming queries.
Therefore, our objective is to support Lookup, Access and streaming membership queries as efficiently as possible in compressed space. (The data structure is static: insertions/deletions of kmers are not supported.) As a first introductory remark we shall mention that the algorithmic literature about the so-called compressed string dictionary problem is rich of solutions, e.g. based on Front-Coding, tries, hashing or combinations of such techniques (see the survey by Mart ınez-Prieto et al. (2016)). However, these solutions are unlikely to be competitive for the specialized version of the problem we tackle here because they are relevant for 'generic' strings that usually: (i) have variable length; (ii) are drawn from larger alphabets (e.g. ASCII); and (iii) do not exhibit particular properties that can aid compression. Instead, k-mers are fixed-length strings; their alphabet of representation is very small (just 2 bits per alphabet symbol are sufficient); and since k-mers are extracted consecutively from DNA, two consecutive strings overlap by kÀ1 symbols that are redundant and should not be represented twice in the dictionary. This motivates the study of specialized solutions for k-mers.
These properties are elegantly captured by the de Bruijn graph representation of S-a graph whose nodes are the k-mers in S and the edges model the string overlaps between the k-mers. Using this formalism, it is possible to reduce the redundancy of the symbols in S by considering paths in the graph and their corresponding strings. We will better formalize this point in Section 2.
For the scope of this work, it is sufficient to point out that: (i) many algorithms have been proposed to build de Bruijn graphs (dBGs) efficiently (Chikhi et al., 2016;Khan et al., 2021;Khan and Patro, 2021) from which these paths can be extracted for indexing purposes; (ii) not surprisingly, essentially all state-of-the-art dictionaries for k-mers-that we briefly review in Section 3-are based on the principle of indexing such collections of paths (Almodaresi et al., 2018;Chikhi et al., 2014;Marchet et al., 2021;Rahman and Medvedev, 2020). We also follow this direction. However, we note that existing dictionary data structures either represent such paths with an FM-index (Ferragina and Manzini, 2000) (or one of its many variants), hence retain highly compressed space but very slow query time in practice or, vice versa, resort on hashing for fast evaluation but take much more space (Almodaresi et al., 2018;Marchet et al., 2021). It is, therefore, desirable to have a good balance between these two extremes.
For this reason, we show that we can still enjoy the query efficiency of hashing while taking small space-significantly less space than prior schemes based on (minimal and perfect) hashing. More specifically, we show how two statistical properties of k-mer minimizers-precisely, those of being sparse and skewly distributed in DNA sequences-can be better exploited to derive an efficient dictionary based on minimal perfect hashing and compact encodings (Section 4). We evaluate the proposed data structure over sets of billions of k-mers, under different query distributions and modalities, and exhibit a substantial performance improvement compared to prior solutions for the problem (Section 5). Our Cþþ implementation of the dictionary is available at https://github.com/jermp/ sshash.

Preliminaries
In this section, we give some preliminary remarks to better support the exposition in Sections 3 and 4. Let S be the collection of the n distinct k-mers extracted from a given, large, string (or set of strings). This string can be, e.g. the genome of an organism.
Throughout the article, we consider to be identical two k-mers that are the reverse complement of each other.
Path covering the de Bruijn graph. A dBG for S is a directed graph GðSÞ where the set of nodes is S and a direct edge from node u to v exists if and only if the last k-1 symbols of u are equal to the first kÀ1 symbols of v, i.e. u½1; k À 1 ¼ v½0; k À 2. It follows that a path traversing ' nodes corresponds to a string of length ' þ k À 1 spelled out by the path, obtained by 'glueing' all the nodes' k-mers in order.
A disjoint-node path cover for GðSÞ is a set of paths where each node belongs to exactly one path, e.g. a set of unitigs, maximal unitigs, or maximal stitched unitigs (Rahman and Medvedev, 2020) (also known as simplitigs (B rinda et al., 2021)). We denote such a cover with S 0 .
The strings in S 0 form the natural basis for a space-efficient dictionary because: (i) by considering paths in the graph, the number of symbols in S 0 is less than the number of symbols in the original S and, (ii) by being a disjoint-node path cover we are guaranteed that there are no duplicate k-mers in S 0 . Therefore, we assume from now on that a path cover S 0 has been computed for GðSÞ as the input for our problem.
Minimizers and super-k-mers. Given a k-mer g, an integer m k, and a total order relation R on all m-length strings, the smallest m-mer of g according to R is called the minimizer of g. R could be, e.g. the simple lexicographic order. Instead, here we use the random order given by a hash function h, chosen from a universal family. Therefore, simply put, the minimizer of g is the m-mer of g that minimizes the value of h (sometimes called a 'random' minimizer).
Minimizers are very popular in sequence analysis, such as for seedand-extend algorithms, because of the following empirical property: consecutive k-mers tend to have the same minimizer (Roberts et al., 2004;Schleimer et al., 2003). This means that there are far less distinct minimizers than k-mers-approximately, ðk À m þ 2Þ=2 times less minimizers than k-mers (independently of the sequence length), if m is not very small compared to k (more precisely, see Zheng et al. (2020, Theorem 3)). For example, if k ¼ 31 and m ¼ 20, we should expect to see %6:5Â less minimizers than k-mers.
Given a string S of length at least k (e.g. a path in a dBG), we call a super-k-mer of S a maximal sequence of consecutive k-mers having the same minimizer (Li et al., 2013).
Minimal perfect hashing. Given a set X of n distinct keys, a function f that maps bijectively the keys into the integer range f0; . . . ; n À 1g is called a minimal perfect hash function (MPHF) for the set X. The function is allowed to return an arbitrary value in ½0; nÞ for any key that does not belong to X, hence it can be realized in small space, in practice 2-3 bits/key (albeit log 2 e % 1:44 bits/key are sufficient in theory (Mehlhorn, 1982)). Many efficient algorithms have been proposed to build MPHFs from static sets that scale well to large values of n and retain practically-constant evaluation time. In this paper, we use PTHash (Pibiri and Trani, 2021a,b) for its very fast evaluation time, usually 2 À 4Â better than other techniques, and good space effectiveness.
Elias-Fano encoding. Given a monotone integer sequence S½0::nÞ whose largest (last) element is less than or equal to a known quantity U, the Elias-Fano encoding represents S in at most nd log 2 ðU=nÞe þ 2n bits (Elias, 1974;Fano, 1971). With o(n) extra bits it is possible to decode any S½i in constant time and support successor queries in OðlogðU=nÞÞ time. We point the interested reader to the survey by Pibiri and Venturini (2021, Section 3.4) for a complete description and discussion of the encoding.

Related work
As anticipated in Section 1, most existing solutions for exact membership queries are based on indexing paths of the dBG (see also Section 2), such as its unitigs or maximal (possibly, stitched) unitigs. These approaches have also been summarized in the recent survey by Chikhi et al. (2021, Section 4.2), hence we give a rather cursory overview here.
The paths can be represented using an FM-index (Ferragina and Manzini, 2000) for very compact space (Chikhi et al., 2014;Rahman and Medvedev, 2020). The practical efficiency of the FMindex mainly depends on how many samples of the suffix-array are kept in the index.
Other approaches resort on hashing for fast lookup queries. For example, Bifrost (Holley and Melsted, 2020) uses a hash table of minimizers whose values are the locations of the minimizers in the unitigs. The index was designed to be dynamic, hence allowing insertion/removal of k-mers and consequent re-computation of the unitigs. The dynamic nature of Bifrost makes it consume higher space compared to static approaches using compressed hash representations and succinct data structures, like Pufferfish (Almodaresi et al., 2018) and Blight (Marchet et al., 2021). Hence, it is regarded as out of scope for this work.
Pufferfish (Almodaresi et al., 2018) associates to each k-mer its location in the unitigs using an MPHF and a vector of absolute positions. The authors also propose a sparse version of the index where the vector of positions is sampled to improve space usage at the expense of query time. Blight (Marchet et al., 2021) is another associative dictionary based on minimal perfect hashing. All the super-kmers having the same minimizer are grouped together into an index partition and a separate MPHF is built for all the k-mers in the partition. Since the k-mers' offsets are relative to a given partition, the space usage is improved compared to Pufferfish. To further reduce space, a k-mer is associated to the segment of 2 b super-k-mers where it belongs to, for a given b ! 0. This reduces the space of the dictionary by b bits per k-mer but a lookup needs to scan (at most) 2 b super-k-mers. Very importantly, Pufferfish and Blight are also optimized for streaming membership queries.

Sparse and skew hashing
In this section, we describe our main contribution: an exact, associative and compressed dictionary data structure for k-mers, supporting fast Lookup, Access and streaming queries. From a high-level point of view, the dictionary is obtained via a careful combination of minimal perfect hashing and compact encodings. In particular, we show how two important properties of minimizers-those of being sparse (Section 4.1) and skewly distributed (Section 4.2) in DNA strings-can be exploited to achieve an efficient dictionary. We aim at a good trade-off between dictionary space and query efficiency.
Recall from Section 2 that the dictionary is built from a collection of paths covering a dBG (e.g. the maximal stitched unitigs), that is: a collection of strings, each of length at least k symbols, with no duplicate k-mers. For ease of notation, we indicate with p the number of paths in the collection and with N their cumulative length (the total number of DNA bases in the input). The number of (distinct) k-mers is, therefore, n ¼ N À pðk À 1Þ.

Sparse hashing
The starting point for our development is based on the well-known empirical property of minimizers in that consecutive k-mers are likely to have the same minimizer. Thus, instead of working with individual k-mers, we focus on maximal sequences of k-mers having the same minimizer-the so-called super-k-mers (see Section 2). Superk-mers are useful because of the following two reasons.
• As super-k-mers are likely to span several consecutive k-mers, we expect to see far fewer super-k-mers than k-mers-approximately, ðk À m þ 2Þ=2 times less for a large-enough minimizer length m. Informally, this property allows a space usage proportional to the number of super-k-mers, thus sparsifying the dictionary. • A super-k-mer of length K is a space-efficient representation for its constituent K À k þ 1 k-mers since it takes 2K=ðK À k þ 1Þ bits/k-mer instead of the trivial cost of 2k bits/k-mer.
Therefore, our refined ambition is to index the super-k-mers of the input using minimizers. Although this can simply be achieved via hashing the minimizers, i.e. by concatenating all the super-k-mers having the same minimizer (like in the Blight index (Marchet et al., 2021))-we claim that his approach is very wasteful in terms of space. In fact, note that each super-k-mer has a fixed cost of 2ðk À 1Þ bits for representing the 'tail' of its string (its last kÀ1 symbols). This fixed cost is only well amortized (say, negligibly small) when the length of the super-k-mer is much larger than kÀ1. In other words, when the super-k-mer contains many more k-mers than kÀ1. While possible in some extreme cases (e.g. the same minimizer repeats in sequence), it is not usually so for the values of k and m used in concrete applications; actually, a super-k-mer is more likely to contain k À m þ 1 k-mers or less.
If z indicates the number of super-k-mers in the input, then the space of this simple solution would be, at least, 2 þ 2zðk À 1Þ=n bits/k-mer (extra space is then needed to accelerate the queries). For example, consider the whole human genome with k ¼ 31 and m ¼ 20. There are more than z ¼ 396 Â 10 6 super-k-mers for, roughly, n ¼ 2:5 Â 10 9 distinct k-mers. Therefore, partitioning the strings according to super-k-mers would cost at least 11.50 bits/kmer. As we will better see in Section 5, our dictionary can be tuned to take, overall, 8.28 bits/k-mer in this case (or less).
Thus, it is of utmost importance to not break the strings according to super-k-mers if space-efficiency is a concern. Instead, we identify a super-k-mer in the strings, whose total length is N, with an absolute offset of d log 2 ðNÞe bits. To be precise, an offset is the position in ½0; NÞ of the first base of a super-k-mer. Since k should be chosen large enough to allow good k-mer specificity, 2ðk À 1Þ will be much larger than d log 2 ðNÞe in practice, even for the largest genomes. For example, we use k ¼ 31 in our experiments, as done in many other works (Almodaresi et al., 2018;Bingmann et al., 2019;Marchet et al., 2021;Rahman and Medvedev, 2020), whereas d log 2 ðNÞe is around 30-33 for collections with billions of k-mers (see also Table 2 at page 6). The use of absolute offsets can almost halve the space overhead for the indexing of super-k-mers in such cases. The space saving is even larger for larger k.
Dictionary layout and compression. Based on the above discussion, we now detail the different components of our dictionary data structure.
• Strings. The p strings in the input are written one after the other in a vector of 2N bits (2 bits per input base). We also keep in a sorted integer sequence of length p the endpoints of the strings to avoid detection of alien k-mers. This sequence, Endpoints, is compressed with Elias-Fano and takes pd log 2 ðN=pÞe þ 2p þ oðpÞ bits. • Minimizers. Let M be the set of all distinct minimizers seen in the input, with M ¼ jMj, and z the number of super-k-mers. Clearly, we have z ! M because a minimizer can appear more than once in the input. Given a minimizer r, let us call the bucket of the minimizer r, B r , the set of all the super-k-mers that have minimizer r. We build an MPHF f for M. The MPHF provides us an addressable space of size M: for a minimizer r, the value f ðrÞ 2 ½0; MÞ is the 'bucket identifier' of r. We keep an array Sizes½0; M þ 1Þ, where Sizes½f ðrÞ þ 1 ¼ jB r j is the size of the bucket of r, and Sizes½0 ¼ 0. We then take the prefix-sums of Sizes, i.e. we replace Sizes½i with Sizes½i þ Sizes½i À 1 for all i > 0. Therefore, for a given minimizer r, now Sizes½f ðrÞ indicates that there are Sizes½f ðrÞ super-k-mers before bucket B r in the order given by f. The MPHF costs roughly 3 bits per minimizer; the Sizes array is compressed with Elias-Fano too and takes ðM þ 1Þd log 2 ðz=ðM þ 1ÞÞe þ 2ðM þ 1Þ þ oðM þ 1Þ bits. • Offsets. The absolute offsets of the super-k-mers into the strings are stored in an array, Offsets½0; zÞ, in the order given by f. For a minimizer r such that Sizes½f ðrÞ ¼ begin, its jB r j offsets are written consecutively (and in sorted order) in Offsets½begin; begin þ jB r jÞ. Note that, by construction, Sizes½f ðrÞ þ 1 À Sizes½f ðrÞ ¼ jB r j > 0. The space for the Offsets array is zd log 2 ðNÞe bits. Figure 1 illustrates the different components of the dictionary and provides a concrete example for an input collection of four strings. Next, we describe how the Lookup and Access queries are supported.
Lookup. We first recall that the Lookup query takes as input a k-mer g and returns a unique identifier i for g: i 2 ½0; nÞ if g is found in the dictionary, or i ¼ -1 otherwise. The Lookup algorithm is as follows.
We compute the minimizer r of g and its bucket identifier as f(r). Then, we locate the super-k-mers in its bucket B r by retrieving the corresponding offsets from Offsets½begin; endÞ, where begin ¼ Sizes½f ðrÞ and end ¼ Sizes½f ðrÞ þ 1. For every offset t in Offsets½begin; endÞ, we scan the super-k-mer starting from Strings½t comparing its k-mers to the query g. If g is not found, we just return -1. Instead, if g is found in position w in the super-k-mer, we return the 'identifier' i of g as i ¼ t þ w À jðk À 1Þ, where j < p is the number of strings before the one containing the offset t (this quantity is computed from the Endpoints array).
Refer to Figure 1 for an example of Lookup and to the Supplementary Material for further technical details.
Double strandedness. A detail of crucial importance for the Lookup algorithm is double strandedness. A k-mer and its reverse complement are considered to be identical. This means that if a kmer g is not found by the Lookup algorithm, there can still be the possibility for its reverse complementĝ to be found in Strings. Therefore, the actual Lookup routine will first search for g andonly if not found-will also search forĝ. This effectively doubles the query time for Lookup in the worst case.
To guarantee that a Lookup will always inspect one single bucket, we use a different minimizer computation (during both query and dictionary construction): we select as minimizer the minimum between the minimizer of g and that ofĝ. In this way, it is guaranteed that two k-mers being the reverse complements of each other always belong to the same bucket.
This different minimizer selection actually changes the parsing of super-k-mers from the input during the construction of the dictionary. We refer to this parsing modality as canonical henceforth, in contrast to the regular modality we assumed so far. When this modality is chosen, we expect to see an increase in the number of distinct minimizers used (on average, the minimizers of g andĝ have equal probability of being the minimum one) for a higher space usage, but faster query time.
We will explore the space/time trade-off between the regular and canonical modalities in Section 5.1.
Access. The Access query retrieves the k-mer string g given its identifier i. This identifier represents the rank of g in Strings but, since the strings have variable lengths and their last kÀ1 symbols do not correspond to valid ranks, we cannot directly access Strings at position i. Instead, we have to compute the offset t corresponding to the k-mer g of rank i. Therefore, we perform a binary search for i in Endpoints to determine t and return g ¼ Strings½t; t þ kÞ.
Lastly, in this section, we point out that the minimizer length m controls a space/time trade-off for the proposed dictionary data structure. Small values of m create fewer and longer super-k-mers, thus lowering the space for the smaller values of z and M. On the other hand, m should not be chosen too small to avoid the scan of many super-k-mers at query time. We will experimentally show the trade-off in Section 5.1. Next, we take a deeper look at lookup time.

Skew hashing
The efficiency of the Lookup query depends on the number of superk-mers in the bucket of a minimizer, which we refer to as the 'size' of the bucket. Since a minimizer can appear multiple times in the input strings, nothing prevents its bucket size to grow unbounded. For example, on the human genome, the largest bucket size can be as large as 3:6 Â 10 4 for m ¼ 20 (or even larger for smaller values of m), meaning that a query inspecting such a bucket would be very slow in practice.
To avoid the burden of these heavy buckets, i.e. to guarantee that a Lookup inspects a constant number of super-k-mers in the worst case, we exploit another important property of minimizers: the distribution of the bucket size is (very) skewed for sufficiently large m. That is, most minimizers appear just once and relatively few of them repeat many times-an observation also made in several previous works (see, e.g. Chikhi et al., 2014;Jain et al., 2020). Table 1 shows an example of such distribution for the first n ¼ 10 9 k-mers (for k ¼ 31) of the human genome. Similar values were obtained for other genomes. (See the tables for other values of n in Supplementary Material.) More precisely, a value in the table represents the fraction of buckets having size s, for s ¼ 1; 2; 3; 4; 5 (only the first 5 sizes are shown for conciseness). The important thing to observe in the example is that, for m > 17, the distribution is very  Fig. 1. A schematic representation of the proposed dictionary data structure. The input of the example contains p ¼ 4 strings (pictorially separated by a '.' symbol, but practically by the Endpoints array) for a total of N ¼ 405 bases, and N À pðk À 1Þ ¼ 405 À 4 Á ð31 À 1Þ ¼ 285 k-mers for k ¼ 31. There are M ¼ 24 minimizers for m ¼ 8 and z ¼ 28 super-k-mers, thus the Sizes and Offsets arrays have length, respectively, M þ 1 ¼ 25 and z ¼ 28. All minimizers have bucket size equal to 1 except for 3 of them (i.e. AACCTGAA, ATCCTGAA, TGTCAAAG) that have bucket size equal to 2. The picture also shows an example of Lookup for the k-mer g ¼ ACATCCTGAAAATTGTCAAAGAATGGCGGCG, whose minimizer r ¼ ATCCTGAA is highlighted in bold font. The flow of the algorithm is represented by the arrows. First, the function f returns the identifier of r as f(r) ¼ 5. Then the bucket size of r is computed: in this case, we have jBrj ¼ end À begin ¼ Sizes½5 þ 1 À Sizes½5 ¼ 9 À 7 ¼ 2, indicating that there are 2 super-k-mers to consider. The offsets of the super-k-mers are retrieved as Offsets½begin ¼ Offsets½7 ¼ 9 and Offsets½begin þ 1 ¼ Offsets½8 ¼ 255. The two super-k-mers are scanned in Strings starting at Strings½9 and Strings½255, respectively. (At most k À m þ 1 ¼ 31 À 8 þ 1 ¼ 24 k-mers are considered in each super-k-mer, as highlighted by the gray box. See the Supplementary Material for a discussion about this point.) Lastly, the k-mer g is found at position w ¼ 8 in the second super-kmer, i.e. at offset t þ w ¼ 255 þ 8 ¼ 263. Since there are two strings before the one containing g, then j ¼ 2, and we have to discard jðk À 1Þ ¼ 2 Á ð31 À 1Þ ¼ 60 invalid ranks for the calculation of the identifier i of g. Therefore, we return i ¼ t þ w À jðk À 1Þ ¼ 263 À 60 ¼ 203 skewed, e.g. most buckets (> 90%) contain just 1 super-k-mer. As we want to take advantage of this distribution, we proceed as follows.
We fix two quantities ' and L, with 0 ' < L. By virtue of the skew distribution, we have that the number of buckets whose size is larger than 2 ' is small for a proper choice of ', as well as the number of k-mers belonging to such buckets. For example with m ¼ 20 and ' ¼ 2, we see from Table 1 that we have 100:0 À ð97:1 þ 1:7 þ 0:4 þ 0:2Þ% ¼ 0:6% of buckets with more than 2 '¼2 ¼ 4 super-kmers. This allows us to build an MPHF to speed up Lookup but only for a small fraction of the total k-mers-in marked contrast with prior schemes (reviewed in Section 3) that build the function over the entire set of k-mers. For ease of exposition, in the following we assume that 2 L max, where max is the largest bucket size (the corner case for 2 ' max < 2 L is straightforward to handle). For ' i L, let S i be the set of all the k-mers belonging to buckets of size s, with s such that 2 i < s 2 iþ1 ' i < L 2 L < s max i ¼ L : We build an MPHF f i for each set S i . Now, given a k-mer g 2 S i , we know that it belongs to a bucket containing at most 2 iþ1 super-kmers. Therefore, we can store the identifier of the super-k-mer containing g in a vector, V i ½0; jS i jÞ, at position f i ðgÞ. Importantly, each integer in V i requires just i þ 1 bits to be represented (V L is formed by d log 2 maxe-bit integers). Figure 2a illustrates this structure.
We point out that, again thanks to the skew distribution, it is very likely to have jS ' j ! jS 'þ1 j ! . . . ! jS LÀ1 j. Therefore, for a proper choice of ' and L, we expect this additional skew index component of the dictionary to take little space, while granting very fast searches.
To make a concrete example, let us consider the human genome and the skew index with ' ¼ 6 and L ¼ 12. So we form L À ' þ 1 ¼ 12 À 6 þ 1 ¼ 7 partitions; each partition is made up of an MPHF and a compact vector. Each MPHF f i can be tuned to take 2:5 À 3:0 bits per key, whereas we spend i þ 1 bits per integer in V i , i ¼ 6; . . . ; 11. (As already mentioned, max ¼ 3:6 Â 10 4 for m ¼ 20, thus we spend d log 2 maxe ¼ 16 bits per integer in V L¼12 .) The crucial point is that we have 0.016% of buckets that comprise more than 2 '¼6 super-k-mers, for just 1.86% of the total k-mers. For this reason, the skew index costs less than 0.21 bits/k-mer over a total of 8.28 bits/k-mer (see also Supplementary Fig. S1a).
Accelerated lookup. Using the skew index to accelerate LookupðgÞ is simple. As for regular Lookup, we compute the minimizer r of g and the quantities begin ¼ Sizes½f ðrÞ and end ¼ Sizes½f ðrÞ þ 1. Therefore, we know that the bucket of r has size end À begin max. Let b ¼ d log 2 ðend À beginÞe. If b ', then the bucket is 'small' and we proceed as already explained in Section 4.1. Otherwise, we know that g, if present in the dictionary, belongs to some partition i of the skew index that, as per our description above, has MPHF f i and compact vector V i . Thus, we retrieve the super-k-mer identifier q ¼ V i ½f i ðgÞ and finally search for g in the super-k-mer whose offset is Offsets½begin þ q. (Note that if q ! end, then g cannot belong to the dictionary.) In conclusion, although the skew index performs 2 additional accesses per Lookup, one for f i and one for V i , it limits the number of accesses made to Strings to 2 ' . (To handle reverse complements, we may have to repeat the process also for the reverse complement of g.) Figure 2b shows a concrete example of Lookup.

Streaming queries
The Lookup algorithm we have described in the previous section is context-less, i.e. it does not take advantage of the specific, consecutive, query order issued by sequence analysis tasks. As already mentioned in Section 1, given a string P of length jPj ! k, we are interested in determining the result of Lookup for all the k-mers read consecutively from P. We would like to do it faster than just performing jPj À k þ 1 independent lookups. Therefore, in this section, we describe some important optimizations for streaming lookup queries that work well with the proposed dictionary data structure. The general idea is to cache some extra information about the result for the k-mer g ¼ P½i; i þ kÞ to speed up the computation for the next k-mer in P, say g nx ¼ P½i þ 1; i þ k þ 1Þ. The algorithm keeps track of the minimizer r of g and the position j at which the last match was found in Strings, i.e. if g belongs to the dictionary, then it is located at Strings½j; j þ kÞ for some j. These two variables make up a state information that is updated during the execution of the algorithm. Given that consecutive kmers are likely to share the same minimizers, we compare r to the minimizer of g nx , say r nx .
• If r nx ¼ r, then we know that g nx belongs to the same bucket B r of g, thus we avoid recomputing f and spare the accesses to both Sizes and Offsets. Also, if g was actually found in the dictionary (therefore, starting at Strings½j) good chances are that g nx is found at Strings½j þ 1. If so, we refer to the latter matching case as an extension. Intuitively, if the algorithm 'extends' frequently, i.e. most matches in P are determined by just looking at consecutive k-mers in Strings, then fast evaluation is retained. If the algorithm does not extend from g to g nx , i.e. g nx is not found at Strings½j þ 1, then we scan the bucket B r . Therefore, if present in (a) ( b) Fig. 2. A schematic view of the skew index component of the dictionary (a), comprising partitions ' i L, each consisting of an MPHF f i and a compact vector V i . Let us consider an example (b) of Lookup for g ¼ GAACCTGAAAACATCCTGAAAATTGTCAAAG and ' ¼ 3. Suppose that the bucket for the minimizer r ¼ GAAAACAT contains s ¼ 13 super-k-mers (whose offsets are Offsets½8475; 8488Þ in the picture), thus it belongs to partition i ¼ 3 because 2 3 < 13 2 3þ1 . (Each integer in V 3 is less than 2 3þ1 , so it can be coded in i þ 1 ¼ log 2 ð2 3þ1 Þ ¼ 4 bits.) Now, also suppose that g is located in the 9-th super-k-mer of the bucket (i.e. that of index 8). It would then be time-consuming to fully scan the 8 super-k-mers before the 9-th. Therefore, we retrieve 8-the index of the 9-th super-k-mer where g is located-from V3½f3ðgÞ and know that g has to be searched for in the super-k-mer whose offset is Offsets½8475 þ 8 the dictionary, g nx will be found at some other position j nx . So we update the state by setting j ¼ j nx . • If r nx 6 ¼ r, then we proceed as for a regular Lookup query, locating the new bucket B rnx and searching for g nx . We then set r ¼ r nx .
Of course it can happen that the minimizer r does not belong to the set of minimizers indexed by the dictionary. Recall from Section 4.1 that we build the MPHF f for the set M of all the distinct minimizers in the input. In this case, we are sure that any k-mer g whose minimizer r 6 2 M is not to be found in the dictionary. By definition, however, we are not able to detect if r 6 2 M using the MPHF f. That is, f will still locate a bucket and all the k-mers in the bucket will have (the same) minimizer, different from r. Therefore, when searching for g, we first compare r with the minimizer of the first k-mer read in the bucket: if they are different, we know that r 6 2 M and g does not belong to the dictionary. In the case when r 6 2 M, the algorithm still caches the last seen minimizer because if r nx ¼ r then also r nx 6 2 M and g nx cannot belong to the dictionary.
In conclusion-as long as the minimizer is the same-either the algorithm works locally in the same bucket, or safely skips the search.
Another convenient information to cache in the state of the algorithm is the orientation of the last match, i.e. whether the last queried k-mer g was found in the dictionary as g or as its reverse complementĝ. In fact, if g was found as g then also g nx is likely to be found as g nx and extension should be tried in forward direction (say, from lower to higher offsets in Strings). But if g was found aŝ g, then is more efficient to try to extend the matching in backward direction, hence effectively iterating backwards in Strings. In fact, suppose that the whole string P (for ease of exposition) is present in Strings but in its reverse complement form. Then the first k-mer g of P will be found asĝ in last position in the located 'region' of Strings, say at some position j. Any other attempt to extend the matching in forward direction (from j to j þ 1) will then fail and any subsequent g nx will be searched for by re-scanning the bucket again. That is, we end up in scanning the bucket for c ¼ jPj À k þ 1 times, for at least Oðc 2 Þ k-mer comparisons. To prevent this quadratic behavior in case of reverse complemented patterns, we try to directly extend the matching for g nx moving from j to jÀ1.

Experiments
In this section, we benchmark the proposed dictionary data structure-which we refer to as SSHash in the following-and compare it against the indexes reviewed in Section 3. For all our experiments, we fix k to 31.
Our implementation of SSHash is written in Cþþ17 and available at https://github.com/jermp/sshash. For the experiments we report here, the code was compiled with gcc 11.2.0 under Ubuntu 19.10 (Linux kernel 5.3.0, 64 bits), using the flags -O3 and -march¼native. We do not explicitly use any SIMD instruction in our codebase.
We use a server machine equipped with an Intel i9-9940X processor (clocked at 3.30 GHz) and 128 GB of RAM. The reported timings were collected using a single core of the processor. All dictionaries were fully loaded in internal memory before running the experiments. The SSHash dictionaries were also built entirely in internal memory.
Datasets. We downloaded some DNA collections (in.fasta format) and built the compacted dBG using the tool BCALM (v2) (Chikhi et al., 2016), without any k-mer filtering, to extract the maximal unitigs. We then run the tool UST (Rahman and Medvedev, 2020) to compute the corresponding path covers. Table 2 reports the basic statistics of the path covers. In particular we used: the whole genomes of the atlantic cod (Gadus morhua) and the common kestrel (Falco tinnunculus), the whole GRCh38 human genome (Homo sapiens), and a collection of more than 8000 bacterial genomes from Almodaresi et al. (2018).
At the code repository https://github.com/jermp/sshash we provide further instructions on how to download and prepare the datasets for indexing.

Tuning
Before comparing SSHash against other dictionaries, we first benchmark SSHash in isolation to fix a suitable choice for m and quantify the impact of the different parsing modalities (regular vs. canonical) that we introduced in Section 4.1. Following our discussion in Section 4.2, we use ' ¼ 6 and L ¼ 12 for all SSHash dictionaries.
To measure query time, we use 10 6 queries and report the mean between 5 measurements. For positive lookups, i.e. those for k-mers present in the dictionary, we sampled uniformly at random 10 6 kmers from each collection and use them as queries. Very importantly, 50% of them were transformed into their reverse complements to make sure we benchmark the dictionaries in the most general case. For negative lookups, we simply use randomly generated k-mer strings. For Access, we generated 10 6 integers uniformly at random in the range ½0; nÞ for each collection and extract the corresponding k-mer strings.
i.e. by n and p: Access is on average 2 À 3Â faster than Lookup since the wanted string is accessed directly, rather than searched for in the dictionary. Iterating thorough all k-mers in the dictionary is very fast and even independent from n: on average, it costs 20-22 ns/kmer. Therefore, for the rest of this section, we entirely focus on lookup time.
Space and lookup time. With the help of Supplementary Table S1 (see also the example at page 5 for n ¼ 10 9 ), we choose some suitable ranges of m for the different dataset sizes. The space/time trade-off by varying m in such ranges, for both regular and canonical parsing modalities, is shown in Table 3. As we discussed in Section 4.1 and apparent from the table, m controls a trade-off between dictionary size and lookup time: the smaller the m value, the more compact the dictionary, but the slower the dictionary as well (and vice versa). While it is difficult to precisely tell by how much the space will grow when moving from m to m þ 1, we see that the space grows by %0:3 À 0:4 bits/k-mer, for both regular and canonical parsing. The canonical parsing modality costs %1:0 À 1:5 bits/kmer more than the regular one for the same value of m because more distinct minimizers are used. However, the canonical version improves lookup time significantly (especially for negative queries), by a factor of 1:4 À 2:0Â on average, because only one bucket per query is inspected in the worst case rather than two by the regular modality.
Since we seek for a good balance between dictionary space and lookup time, in the light of the results reported in Table 3, we choose m as follows. For Cod and Kestrel: m ¼ 17 with regular parsing; m ¼ 16 with canonical parsing; for Human and Bacterial: m ¼ 20 with regular parsing; m ¼ 19 with canonical parsing. In the following, we assume these values of m are used and omit the indication from the tables.
In general, we observe that a good value for m satisfies 4 m > N, i.e. m should be chosen as to have-at least-as many possible minimizers as the number of bases in the input. It is therefore recommended to use m ¼ d log 4 ðNÞe þ 1 or m ¼ d log 4 ðNÞe þ 2.

Comparison against other dictionaries
In this section, we compare SSHash against the following state-ofthe-art dictionaries that we briefly reviewed in Section 3: • dBG-FM (Chikhi et al., 2014)-An implementation of the popular FM-index tailored for DNA. This implementation is widely used as an exact membership data structure for k-mers (Chikhi et al., 2014;Rahman and Medvedev, 2020), also in the ABySS assembler (Simpson et al., 2009;Jackman et al., 2017). The implementation has a main trade-off parameter (a sampling factor) that we vary as s ¼ 32, 64, 128. • Pufferfish (Almodaresi et al., 2018)-We test both the dense and sparse versions of the index. The sparse version was obtained with parameters s ¼ 9 and e ¼ 4 as used in the original paper. • Blight (Marchet et al., 2021)-We test the index with sampling rate b ¼ 0, 2, 4 and minimizer length m ¼ 10 as suggested in the paper. We recall that a sampling rate of b > 0 reduces the index space by b bits/k-mer at the expense of query time.
We use the Cþþ implementations from the respective authors: links to the various GitHub libraries are provided in the References. All sources were compiled using the same compilation flags as used for SSHash. Space and lookup time. We first consider the space of the dictionaries, reported in Table 4. The space of SSHash is significantly better than that of the other approaches based on minimal perfect hashing, roughly: 2 À 2:5Â (or more) better than Blight, and 3 À 5Â better than Pufferfish. This is so primarily because these approaches build an MPHF for the entire set of k-mers, hence associate a positional information (e.g. in the reference genome) to each k-mer in the input. We point out that, unlike for Blight, this is expected for Pufferfish dense since it was exactly designed for the purpose of reference mapping. (The shaded rows in Table 4 account for the space needed by Pufferfish to only support Lookup, i.e. discarding the color information in its colored dBG structure.) The dBG-FM index is, not surprisingly, the most compact, thanks to the compression of the powerful Burrows-Wheeler transform (BWT) (Burrows and Wheeler, 1994). (We were unable to build the index correctly on the larger Human and Bacterial datasets.) While dBG-FM is several times smaller than Pufferfish and Blight, note that its smallest version tested (for s ¼ 128) is only essentially 2Â smaller than regular SSHash and this gap diminishes at higher sampling rates. For example, dBG-FM for s ¼ 32 is only 13 À 17% smaller than regular SSHash. However, SSHash answers lookup queries much faster than dBG-FM as shown in Table 5. A lookup query in the dBG-FM index is implemented as a classic count query on a FM-index (see the paper by Ferragina and Manzini (2000) for details) which, for a pattern of length k, generates at least k cachemisses. This cost is even higher for the handling of reverse complements that may induce two distinct count queries.
We observe that SSHash regular is as fast as (or faster than) the fastest Blight's version, for b ¼ 0, and faster for higher b. SSHash canonical is always much faster than Blight. Pufferfish dense is instead faster than SSHash thanks to its simpler lookup procedure that just needs to retrieve the absolute offset of a k-mer using hashing and check the k-mer against the reference string. However, we point out that: (i) this higher Lookup efficiency comes at a significant penalty in space effectiveness compared to SSHash, and (ii) the sparse variant's performance degrades on larger datasets.
Streaming membership query time. We now consider streaming membership queries. Pufferfish and Blight are also optimized to answer these kind of stateful queries. To query the dictionaries, we use some reads (in.fastq format) downloaded from the European Nucleotide Archive, and related to each dataset-Cod: run accession SRR12858649 with 2 041 092 reads, each of length 110 bases; Kestrel: run accession SRR11449743 with 14 647 106 reads, each of length 125 bases; Human: run accession SRR5833294 with 34 129 891 reads, each of length 76 bases; Bacterial: run accession SRR5901135 with 4 628 576 reads of variable length (a sequencing run of Escherichia Coli).
We lookup for every k-mer read in sequence from the query files. For all the indexes, we just count the number of returned results rather than saving them to a vector. The result is reported in Table 6.
In general terms, we see that SSHash is either comparable to or faster (by 2 À 3Â) than Pufferfish and Blight. This holds true for both high-hit workloads (> 70% hits, i.e. k-mers present in the dictionary) and low-hit workloads ( < 1% hits). It is important to benchmark the dictionaries under these two different query scenarios as both situations are meaningful in practice. (In our experiments, low-hit workloads are obtained by querying the dictionaries using a different query file as indicated in Table 6.) Indeed, observe that while Pufferfish's performance is robust under both scenarios, Blight's query time significantly degrades when most queries are Note: The query time is reported as total time in minutes (tot), and average ns/k-mer (avg). We also indicate the query file (SRR number) and the percentage of hits. Both high-hit (> 70% hits) and low-hit ( < 1% hits) workloads are considered. negative, especially for b > 0. Also regular SSHash is almost 2Â slower for low-hit workloads compared to high-hit workloads. This is expected, however, because almost all queries are exhaustively inspecting two buckets per k-mer as we explained in Section 4.1. Note that its performance is anyway better than Blight's and not much worse than Pufferfish's (dense variant). The canonical version of SSHash protects against this behavior in case of low-hit workload and, in fact, is generally the fastest dictionary.
Another meaningful point to mention is that SSHash does not allocate extra memory at query time, i.e. only the memory of the index-as reported in Table 4-is retained (the memory for the state information maintained by the streaming algorithm described in Section 4.3 is constant). Pufferfish also does not allocate extra memory. Blight, instead, consumes more memory at query time than that required by its index layout on disk. For example, to perform the queries on the Human dataset in Table 6a, Blight with b ¼ 0 uses a maximum resident set size of 7.51 GB compared to the 6.32 GB taken by its index on disk (23.98 versus 20.17 bits/k-mer). This effect is even accentuated for higher b values.
Construction time. Table 7 reports the time and internal memory used to build the dictionaries. The dBG-FM index needs to build the BWT of the input prior to indexing. This step can be very time consuming for large collections such as the ones of practical interest. That is, another important advantage of schemes based on hashing compared to BWT-based indexes is that they require significantly less time to build. This is evident from the result reported in the table.
Due to space constraints, we do not describe the SSHash's construction algorithm in this article. We just point out that the construction is efficient; indeed SSHash took much less time to build on the test collections compared to both Blight and Pufferfish. Its memory usage is comparable to that of Pufferfish, while Blight scales better in this regard (e.g. by retaining 2Â less internal memory on Human and Bacterial) as it partially uses external memory. In future work, we will adapt the SSHash dictionary construction to use external memory too.
We also note that the SSHash canonical takes consistently more time and space to build than the regular variant: this is a direct consequence of the denser sampling of minimizers.

Conclusions and future work
We have studied the compressed dictionary problem for k-mers and proposed a solution, SSHash, based on a careful orchestration of minimal perfect hashing and compact encodings. In particular, SSHash is an exact and associative k-mer dictionary designed to deliver good practical performance. From a technical perspective, SSHash exploits the sparseness and the skew distribution of k-mer minimizers to achieve compact space, while allowing fast lookup queries.
We tested SSHash on collections of billions of k-mers and compared it against other indexes, under different query workloads (high-versus low-hit) and modalities (random versus streaming). Our implementation of SSHash is written in Cþþ and open source.
Compared to BWT-based indexes (like the dBG-FM index), SSHash is more than one order of magnitudes faster at lookup for only 2Â larger space on average. Compared to prior schemes based on minimal perfect hashing (like Pufferfish and Blight), SSHash is significantly more compact (2 À 5Â depending on the configuration) without sacrificing query efficiency. Indeed, SSHash is also the fastest dictionary for streaming membership queries. For these reasons, we believe that SSHash embodies a superior space/time trade-off for the problem tackled in this work.
Several avenues for future work are possible. We mention some promising ones. First, we will engineer the dictionary construction to use multi-threading and external memory. Parallel query processing is also interesting; since SSHash is a read-only data structure, its queries are amenable to parallelism. We could also add support for other types of queries, such as navigational queries (Chikhi et al., 2014) that, given a k-mer g, ask to enumerate all the extensions of g (i.e. in both forward and backward direction) that are present in the dictionary. Another promising direction could adapt the SSHash data structure to also store the abundances of k-mers, which is a separate but related problem in the literature (Italiano et al., 2021;Shibuya et al., 2021). Based on the observation that consecutive kmers tend to have the same or very similar abundance we expect to add a small extra space to SSHash to store this information. In this article, we focused on minimizers for their simplicity and practical efficiency but one could also explore the effects of replacing the minimizers with other types of string sampling mechanisms (Loukides and Pissis, 2021;Sahlin, 2021). Lastly, we also plan to study the approximate version of the dictionary problem where it is allowed to tolerate a prescribed false positive rate. Table 7. Dictionary construction times in minutes (using a single processing thread) and peak internal memory used during construction in GB (blight's performance was the same for all values of b in the experiment)